Robust named entity detection from optical character recognition output
Identifieur interne : 000545 ( Main/Exploration ); précédent : 000544; suivant : 000546Robust named entity detection from optical character recognition output
Auteurs : Krishna Subramanian [États-Unis] ; Rohit Prasad [États-Unis] ; Prem Natarajan [États-Unis]Source :
- International journal on document analysis and recognition : (Print) [ 1433-2833 ] ; 2011.
Descripteurs français
- Pascal (Inist)
- Wicri :
- topic : Linguistique, Multilinguisme.
English descriptors
- KwdEn :
Abstract
In this paper, we focus on information extraction from optical character recognition (OCR) output. Since the content from OCR inherently has many errors, we present robust algorithms for information extraction from OCR lattices instead of merely looking them up in the top-choice (1-best) OCR output. Specifically, we address the challenge of named entity detection in noisy OCR output and show that searching for named entities in the recognition lattice significantly improves detection accuracy over 1-best search. While lattice-based named entity (NE) detection improves NE recall from OCR output, there are two problems with this approach: (1) the number of false alarms can be prohibitive for certain applications and (2) lattice-based search is computationally more expensive than 1-best NE lookup. To mitigate the above challenges, we present techniques for reducing false alarms using confidence measures and for reducing the amount of computation involved in performing the NE search. Furthermore, to demonstrate that our techniques are applicable across multiple domains and languages, we experiment with optical character recognition systems for videotext in English and scanned handwritten text in Arabic.
Affiliations:
Links toward previous steps (curation, corpus...)
- to stream PascalFrancis, to step Corpus: 000122
- to stream PascalFrancis, to step Curation: 000651
- to stream PascalFrancis, to step Checkpoint: 000099
- to stream Main, to step Merge: 000551
- to stream Main, to step Curation: 000545
Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">Robust named entity detection from optical character recognition output</title>
<author><name sortKey="Subramanian, Krishna" sort="Subramanian, Krishna" uniqKey="Subramanian K" first="Krishna" last="Subramanian">Krishna Subramanian</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>Raytheon BBN Technologies, 10 Moulton Street</s1>
<s2>Cambridge, MA 02138</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Massachusetts</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Prasad, Rohit" sort="Prasad, Rohit" uniqKey="Prasad R" first="Rohit" last="Prasad">Rohit Prasad</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>Raytheon BBN Technologies, 10 Moulton Street</s1>
<s2>Cambridge, MA 02138</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Massachusetts</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Natarajan, Prem" sort="Natarajan, Prem" uniqKey="Natarajan P" first="Prem" last="Natarajan">Prem Natarajan</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>Raytheon BBN Technologies, 10 Moulton Street</s1>
<s2>Cambridge, MA 02138</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Massachusetts</region>
</placeName>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">11-0343815</idno>
<date when="2011">2011</date>
<idno type="stanalyst">PASCAL 11-0343815 INIST</idno>
<idno type="RBID">Pascal:11-0343815</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000122</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000651</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000099</idno>
<idno type="wicri:doubleKey">1433-2833:2011:Subramanian K:robust:named:entity</idno>
<idno type="wicri:Area/Main/Merge">000551</idno>
<idno type="wicri:Area/Main/Curation">000545</idno>
<idno type="wicri:Area/Main/Exploration">000545</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">Robust named entity detection from optical character recognition output</title>
<author><name sortKey="Subramanian, Krishna" sort="Subramanian, Krishna" uniqKey="Subramanian K" first="Krishna" last="Subramanian">Krishna Subramanian</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>Raytheon BBN Technologies, 10 Moulton Street</s1>
<s2>Cambridge, MA 02138</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Massachusetts</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Prasad, Rohit" sort="Prasad, Rohit" uniqKey="Prasad R" first="Rohit" last="Prasad">Rohit Prasad</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>Raytheon BBN Technologies, 10 Moulton Street</s1>
<s2>Cambridge, MA 02138</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Massachusetts</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Natarajan, Prem" sort="Natarajan, Prem" uniqKey="Natarajan P" first="Prem" last="Natarajan">Prem Natarajan</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>Raytheon BBN Technologies, 10 Moulton Street</s1>
<s2>Cambridge, MA 02138</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Massachusetts</region>
</placeName>
</affiliation>
</author>
</analytic>
<series><title level="j" type="main">International journal on document analysis and recognition : (Print)</title>
<title level="j" type="abbreviated">Int. j. doc. anal. recognit. : (Print)</title>
<idno type="ISSN">1433-2833</idno>
<imprint><date when="2011">2011</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><title level="j" type="main">International journal on document analysis and recognition : (Print)</title>
<title level="j" type="abbreviated">Int. j. doc. anal. recognit. : (Print)</title>
<idno type="ISSN">1433-2833</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Arabic</term>
<term>Character recognition</term>
<term>Confidence</term>
<term>False alarm rate</term>
<term>Hidden Markov model</term>
<term>Information extraction</term>
<term>Lattice</term>
<term>Linguistics</term>
<term>Manuscript character</term>
<term>Multilingualism</term>
<term>Natural language</term>
<term>Optical character recognition</term>
<term>Text</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Reconnaissance optique caractère</term>
<term>Extraction information</term>
<term>Reconnaissance caractère</term>
<term>Treillis</term>
<term>Linguistique</term>
<term>Langage naturel</term>
<term>Caractère manuscrit</term>
<term>Texte</term>
<term>Taux fausse alarme</term>
<term>Confiance</term>
<term>Multilinguisme</term>
<term>Arabe</term>
<term>Modèle Markov caché</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr"><term>Linguistique</term>
<term>Multilinguisme</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">In this paper, we focus on information extraction from optical character recognition (OCR) output. Since the content from OCR inherently has many errors, we present robust algorithms for information extraction from OCR lattices instead of merely looking them up in the top-choice (1-best) OCR output. Specifically, we address the challenge of named entity detection in noisy OCR output and show that searching for named entities in the recognition lattice significantly improves detection accuracy over 1-best search. While lattice-based named entity (NE) detection improves NE recall from OCR output, there are two problems with this approach: (1) the number of false alarms can be prohibitive for certain applications and (2) lattice-based search is computationally more expensive than 1-best NE lookup. To mitigate the above challenges, we present techniques for reducing false alarms using confidence measures and for reducing the amount of computation involved in performing the NE search. Furthermore, to demonstrate that our techniques are applicable across multiple domains and languages, we experiment with optical character recognition systems for videotext in English and scanned handwritten text in Arabic.</div>
</front>
</TEI>
<affiliations><list><country><li>États-Unis</li>
</country>
<region><li>Massachusetts</li>
</region>
</list>
<tree><country name="États-Unis"><region name="Massachusetts"><name sortKey="Subramanian, Krishna" sort="Subramanian, Krishna" uniqKey="Subramanian K" first="Krishna" last="Subramanian">Krishna Subramanian</name>
</region>
<name sortKey="Natarajan, Prem" sort="Natarajan, Prem" uniqKey="Natarajan P" first="Prem" last="Natarajan">Prem Natarajan</name>
<name sortKey="Prasad, Rohit" sort="Prasad, Rohit" uniqKey="Prasad R" first="Rohit" last="Prasad">Rohit Prasad</name>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000545 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000545 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= Main |étape= Exploration |type= RBID |clé= Pascal:11-0343815 |texte= Robust named entity detection from optical character recognition output }}
This area was generated with Dilib version V0.6.32. |